System And Method For Providing Document Classification Suggestions

ABSTRACT

A system and method for providing document classification suggestions is provided. Clusters of uncoded documents are obtained. One or more of the uncoded documents in one such cluster are compared to a set of reference documents. Each reference document is assigned with a classification code. Those reference documents that are similar to the one or more uncoded documents are identified. Different types of the classification codes for at least a portion of the similar reference documents are identified and a count of the classification codes assigned to the portion of similar reference documents for each classification code type is obtained. A suggestion for classification of at least one of the one or more uncoded documents is provided based on the count of classification codes for each classification type and one of a presence and absence of each classification code type.

CROSS-REFERENCE TO RELATED APPLICATION

This patent application is a continuation of commonly-assigned U.S.patent application Ser. No. 14/564,058 filed Dec. 8, 2014, pending,which is a continuation of U.S. Pat. No. 8,909,647, issued Dec. 9, 2014,which is a continuation of U.S. Pat. No. 8,515,957, issued Aug. 20,2013, which claims priority under 35 U.S.C. §119(e) to U.S. ProvisionalPatent Application, Ser. No. 61/229,216, filed Jul. 28, 2009, and U.S.Provisional Patent Application, Ser. No. 61/236,490, filed Aug. 24,2009, the priority dates of which are claimed and the disclosures ofwhich are incorporated by reference.

FIELD

This application relates in general to using electronically storedinformation as a reference point and, in particular, to a system andmethod for providing document classification suggestions.

BACKGROUND

Historically, document review during the discovery phase of litigationand for other types of legal matters, such as due diligence andregulatory compliance, have been conducted manually. During documentreview, individual reviewers, generally licensed attorneys, are assignedsets of documents for coding. A reviewer must carefully study eachdocument and categorize the document by assigning a code or other markerfrom a set of descriptive classifications, such as “privileged,”“responsive,” and “non-responsive.” The classifications affect thedisposition of each document, including admissibility into evidence.During discovery, document review can potentially affect the outcome ofthe underlying legal matter, so consistent and accurate results arecrucial.

Manual document review is tedious and time-consuming. Marking documentsis solely at the discretion of each reviewer and inconsistent resultsmay occur due to misunderstanding, time pressures, fatigue, or otherfactors. A large volume of documents reviewed, often with only limitedtime, can create a loss of mental focus and a loss of purpose for theresultant classification. Each new reviewer also faces a steep learningcurve to become familiar with the legal matter, coding categories, andreview techniques.

Currently, with the increasingly widespread movement to electronicallystored information (ESI), manual document review is no longerpracticable. The often exponential growth of ESI exceeds the boundsreasonable for conventional manual human review and underscores the needfor computer-assisted ESI review tools.

Conventional ESI review tools have proven inadequate to providingefficient, accurate, and consistent results. For example, DiscoverReadyLLC, a Delaware limited liability company, conducts semi-automateddocument review through multiple passes over a document set in ESI form.During the first pass, documents are grouped by category and basic codesare assigned. Subsequent passes refine and further assign codings.Multiple pass review also requires a priori project-specific knowledgeengineering, which is useful for only the single project, thereby losingthe benefit of any inferred knowledge or know-how for use in otherreview projects.

Thus, there remains a need for a system and method for increasing theefficiency of document review that bootstraps knowledge gained fromother reviews while ultimately ensuring independent reviewer discretion.

SUMMARY

Document review efficiency can be increased by identifying relationshipsbetween reference ESI and uncoded ESI and providing a suggestion forclassification based on the relationships. A set of clusters includinguncoded ESI is obtained. The uncoded ESI for a cluster are compared to aset of reference ESI. Those reference ESI most similar to the uncodedESI are identified and inserted into the cluster. The relationshipbetween the inserted reference ESI and uncoded ESI for the cluster arevisually depicted and provide a suggestion regarding classification ofthe uncoded ESI.

An embodiment provides a system and method for providing documentclassification suggestions. Clusters of uncoded documents are obtained.One or more of the uncoded documents in one such cluster are compared toa set of reference documents. Each reference document is assigned with aclassification code. Those reference documents that are similar to theone or more uncoded documents are identified. Different types of theclassification codes for at least a portion of the similar referencedocuments are identified and a count of the classification codesassigned to the portion of similar reference documents for eachclassification code type is obtained. A suggestion for classification ofat least one of the one or more uncoded documents is provided based onthe count of classification codes for each classification type and oneof a presence and absence of each classification code type.

Still other embodiments of the present invention will become readilyapparent to those skilled in the art from the following detaileddescription, wherein are described embodiments by way of illustratingthe best mode contemplated for carrying out the invention. As will berealized, the invention is capable of other and different embodimentsand its several details are capable of modifications in various obviousrespects, all without departing from the spirit and the scope of thepresent invention. Accordingly, the drawings and detailed descriptionare to be regarded as illustrative in nature and not as restrictive.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 is a block diagram showing a system for displaying relationshipsbetween ESI to provide classification suggestions via injection, inaccordance with one embodiment.

FIG. 2 is a process flow diagram showing a method for displayingrelationships between ESI to provide classification suggestions viainjection, in accordance with one embodiment.

FIG. 3 is a process flow diagram showing, by way of example, a methodfor forming clusters for use in the method of FIG. 2.

FIG. 4 is a block diagram showing, by way of example, cluster measuresfor comparing uncoded documents with and identifying similar referencedocuments for use in the method of FIG. 2.

FIG. 5 is a screenshot showing, by way of example, a visual display ofreference documents in relation to uncoded documents.

FIG. 6A is a block diagram showing, by way of example, a cluster with“privileged” reference documents and uncoded documents.

FIG. 6B is a block diagram showing, by way of example, a cluster 96 with“non-responsive” reference documents 97 and uncoded documents 94.

FIG. 6C is a block diagram showing, by way of example, a cluster 98 witha combination of classified reference documents and uncoded documents94.

FIG. 7 is a process flow diagram showing, by way of example, a methodfor classifying uncoded documents for use in the method of FIG. 2 usinga classifier.

FIG. 8 is a screenshot showing, by way of example, a reference optionsdialogue box for entering user preferences for reference documentinjection.

DETAILED DESCRIPTION

The ever-increasing volume of ESI underlies the need for automatingdocument review for improved consistency and throughput. Previouslyclassified ESI offer knowledge gleaned from earlier work in similarlegal projects, as well as a reference point for classifying uncodedESI.

Providing Suggestions Using Reference Documents

Reference ESI is previously classified by content and can be injectedinto clusters of uncoded, that is unclassified, ESI to influenceclassification of the uncoded ESI. Specifically, relationships betweenan uncoded ESI and the reference ESI in terms of semantic similarity ordistinction can be used as an aid in providing suggestions forclassifying uncoded ESI.

Complete ESI review requires a support environment within whichclassification can be performed. FIG. 1 is a block diagram showing asystem 10 for displaying relationships between ESI to provideclassification suggestions via injection. By way of illustration, thesystem 10 operates in a distributed computing environment, whichincludes a plurality of heterogeneous systems and ESI sources.Henceforth, a single item of ESI will be referenced as a “document,”although ESI can include other forms of non-document data, as describedinfra. A backend server 11 is coupled to a storage device 13, whichstores documents 14 a, such as uncoded documents in the form ofstructured or unstructured data, a database 30 for maintaininginformation about the documents, and a lookup database 38 for storingmany-to-many mappings 39 between documents and document features, suchas concepts. The storage device 13 also stores reference documents 14 b,which provide a training set of trusted and known results for use inguiding ESI classification. The reference documents 14 b are eachassociated with an assigned classification code and considered asclassified or coded. Hereinafter, the terms “classified” and “coded” areused interchangeably with the same intended meaning, unless otherwiseindicated. A set of reference documents can be hand-selected orautomatically selected through guided review, which is further discussedbelow. Additionally, the set of reference documents can be predeterminedor can be generated dynamically, as uncoded documents are classified andsubsequently added to the set of reference documents.

The backend server 11 is coupled to an intranetwork 21 and executes aworkbench software suite 31 for providing a user interface framework forautomated document management, processing, analysis, and classification.In a further embodiment, the backend server 11 can be accessed via aninternetwork 22. The workbench software suite 31 includes a documentmapper 32 that includes a clustering engine 33, similarity searcher 34,classifier 35, and display generator 36. Other workbench suite modulesare possible.

The clustering engine 33 performs efficient document scoring andclustering of uncoded documents, such as described in commonly-assignedU.S. Pat. No. 7,610,313, the disclosure of which is incorporated byreference. Clusters of uncoded documents 14 can be organized alongvectors, known as spines, based on a similarity of the clusters. Thesimilarity can be expressed in terms of distance. Document clustering isfurther discussed below with reference to FIG. 3. The similaritysearcher 34 identifies the reference documents 14 b that are mostsimilar to selected uncoded documents 14 a, clusters, or spines, whichis further described below with reference to FIG. 4. The classifier 35provides a machine-generated suggestion and confidence level forclassification of the selected uncoded document 14 a, cluster, or spine,as further described with reference to FIG. 7. The display generator 36arranges the clusters and spines in thematic relationships in atwo-dimensional visual display space and inserts the identifiedreference documents into one or more of the clusters, as furtherdescribed below beginning with reference to FIG. 2. Once generated, thevisual display space is transmitted to a work client 12 by the backendserver 11 via the document mapper 32 for presenting to a reviewer on adisplay 37. The reviewer can include an individual person who isassigned to review and classify one or more uncoded documents bydesignating a code. Hereinafter, unless otherwise indicated, the terms“reviewer” and “custodian” are used interchangeably with the sameintended meaning. Other types of reviewers are possible, includingmachine-implemented reviewers.

The document mapper 32 operates on uncoded documents 14 a, which can beretrieved from the storage 13, as well as a plurality of local andremote sources. The local and remote sources can also store thereference documents 14 b. The local sources include documents 17maintained in a storage device 16 coupled to a local server 15 anddocuments 20 maintained in a storage device 19 coupled to a local client18. The local server 15 and local client 18 are interconnected to thebackend server 11 and the work client 12 over the intranetwork 21. Inaddition, the document mapper 32 can identify and retrieve documentsfrom remote sources over the internetwork 22, including the Internet,through a gateway 23 interfaced to the intranetwork 21. The remotesources include documents 26 maintained in a storage device 25 coupledto a remote server 24 and documents 29 maintained in a storage device 28coupled to a remote client 27. Other document sources, either local orremote, are possible.

The individual documents 14 a, 14 b, 17, 20, 26, 29 include all formsand types of structured and unstructured ESI, including electronicmessage stores, word processing documents, electronic mail (email)folders, Web pages, and graphical or multimedia data. Notwithstanding,the documents could be in the form of structurally organized data, suchas stored in a spreadsheet or database.

In one embodiment, the individual documents 14 a, 14 b, 17, 20, 26, 29can include electronic message folders storing email and attachments,such as maintained by the Outlook and Outlook Express products, licensedby Microsoft Corporation, Redmond, Wash. The database can be anSQL-based relational database, such as the Oracle database managementsystem, release 8, licensed by Oracle Corporation, Redwood Shores,Calif.

The individual documents can be designated and stored as uncodeddocuments or reference documents. The reference documents are initiallyuncoded documents that can be selected from the corpus or other sourceof uncoded documents and subsequently classified. The referencedocuments assist in providing suggestions for classification of theremaining uncoded documents in the corpus based on visual relationshipsbetween the uncoded documents and reference documents. The reviewer canclassify one or more of the remaining uncoded documents by assigning aclassification code based on the relationships. In a further embodiment,the reference documents can be used as a training set to formmachine-generated suggestions for classifying the remaining uncodeddocuments, as further described below with reference to FIG. 7.

The reference documents are representative of the document corpus for areview project in which data organization or classification is desiredor a subset of the document corpus. A set of reference documents can begenerated for each document review project or alternatively, thereference documents can be selected from a previously conducted documentreview project that is related to the current document review project.Guided review assists a reviewer in building a reference document setrepresentative of the corpus for use in classifying uncoded documents.During guided review, uncoded documents that are dissimilar to all otheruncoded documents in the corpus are identified based on a similaritythreshold. Other methods for determining dissimilarity are possible.Identifying the dissimilar documents provides a group of uncodeddocuments that is representative of the corpus for a document reviewproject. Each identified dissimilar document is then classified byassigning a particular classification code based on the content of thedocument to generate a set of reference documents for the documentreview project. Guided review can be performed by a reviewer, a machine,or a combination of the reviewer and machine.

Other methods for generating a reference document set for a documentreview project using guided review are possible, including clustering.For example, a set of uncoded document to be classified is clustered, asdescribed in commonly-assigned U.S. Pat. No. 7,610,313, the disclosureof which is incorporated by reference. A plurality of the clustereduncoded documents are selected based on selection criteria, such ascluster centers or sample clusters. The cluster centers can be used toidentify uncoded documents in a cluster that are most similar ordissimilar to the cluster center. The identified uncoded documents arethen selected for classification by assigning codes. Afterclassification, the previously uncoded documents represent a referenceset. In a further example, sample clusters can be used to generate areference set by selecting one or more sample clusters based on clusterrelation criteria, such as size, content, similarity, or dissimilarity.The uncoded documents in the selected sample clusters are then assignedclassification codes. The classified documents represent a referencedocument set for the document review project. Other methods forselecting uncoded documents for use as a reference set are possible.

The document corpus for a document review project can be divided intosubsets of uncoded documents, which are each provided as an assignmentto a particular reviewer. To maintain consistency, the sameclassification codes can be used across all assignments in the documentreview project. The classification codes can be determined usingtaxonomy generation, during which a list of classification codes can beprovided by a reviewer or determined automatically. For purposes oflegal discovery, the classification codes used to classify uncodeddocuments can include “privileged,” “responsive,” or “non-responsive.”Other codes are possible. A “privileged” document contains informationthat is protected by a privilege, meaning that the document should notbe disclosed to an opposing party. Disclosing a “privileged” documentcan result in an unintentional waiver of the subject matter. A“responsive” document contains information that is related to a legalmatter on which the document review project is based and a“non-responsive” document includes information that is not related tothe legal matter.

Utilizing reference documents to assist in classifying uncodeddocuments, clusters, or spines can be performed by the system 10, whichincludes individual computer systems, such as the backend server 11,work server 12, server 15, client 18, remote server 24 and remote client27. The individual computer systems are general purpose, programmeddigital computing devices consisting of a central processing unit (CPU),random access memory (RAM), non-volatile secondary storage, such as ahard drive or CD ROM drive, network interfaces, and peripheral devices,including user interfacing means, such as a keyboard and display. Thevarious implementations of the source code and object and byte codes canbe held on a computer-readable storage medium, such as a floppy disk,hard drive, digital video disk (DVD), random access memory (RAM),read-only memory (ROM) and similar storage mediums. For example, programcode, including software programs, and data are loaded into the RAM forexecution and processing by the CPU and results are generated fordisplay, output, transmittal, or storage.

Identifying the reference documents for use as classificationsuggestions includes a comparison of the uncoded documents and thereference documents. FIG. 2 is a process flow diagram showing a method40 for displaying relationships between ESI to provide classificationsuggestions via injection. A set of clusters of uncoded documents isobtained (block 41). For each cluster, a cluster center is determinedbased on the uncoded documents included in that cluster. The clusterscan be generated upon command or previously generated and stored.Clustering uncoded documents is further discussed below with referenceto FIG. 3. One or more uncoded documents can be compared with a set ofreference documents (block 42) and those reference documents thatsatisfy a threshold of similarity are selected (block 43). Determiningsimilar reference documents is further discussed below with reference toFIG. 4. The selected reference documents are then injected into thecluster associated with the one or more uncoded documents (block 44).The selected reference documents injected into the cluster can be thesame as or different than the selected reference documents injected intoanother cluster. The total number of reference documents and uncodeddocuments in the clusters can exceed the sum of the uncoded documentsoriginally clustered and the reference document set. In a furtherembodiment, a single uncoded document or spine can be compared to thereference document set to identify similar reference documents forinjecting into the cluster set.

Together, reference documents injected into the clusters represent asubset of reference documents specific to that cluster set. The clustersof uncoded documents and inserted reference documents can be displayedto visually depict relationships (block 45) between the uncodeddocuments in the cluster and the inserted reference documents. Therelationships can provide a suggestion for use by an individualreviewer, for classifying that cluster. Determining relationshipsbetween the reference documents and uncoded documents to identifyclassification suggestions is further discussed below with reference toFIG. 6A-6C. Further, machine classification can optionally provide aclassification suggestion based on a calculated confidence level (block46). Machine-generated classification suggestions and confidence levelsare further discussed below with reference to FIG. 7. The above processhas been described with reference to documents; however, other objectsor tokens are possible.

Obtaining Clusters

The corpus of uncoded documents for a review project can be divided intoassignments using assignment criteria, such as custodian or source ofthe uncoded documents, content, document type, and date. Other criteriaare possible. Each assignment is assigned to an individual reviewer foranalysis. The assignments can be separately clustered or alternatively,all of the uncoded documents in the document corpus can be clusteredtogether. The content of each uncoded document within the corpus can beconverted into a set of tokens, which are word-level or character-leveln-grams, raw terms, concepts, or entities. Other tokens are possible.

An n-gram is a predetermined number of items selected from a source. Theitems can include syllables, letters, or words, as well as other items.A raw term is a term that has not been processed or manipulated.Concepts typically include nouns and noun phrases obtained throughpart-of-speech tagging that have a common semantic meaning. Entitiesfurther refine nouns and noun phrases into people, places, and things,such as meetings, animals, relationships, and various other objects.Entities can be extracted using entity extraction techniques known inthe field. Clustering of the uncoded documents can be based on clustercriteria, such as the similarity of tokens, including n-grams, rawterms, concepts, entities, email addresses, or other metadata.

Clustering provides groupings of related uncoded documents. FIG. 3 is aflow diagram showing a routine 50 for forming clusters for use in themethod 40 of FIG. 2. The purpose of this routine is to use score vectorsassociated with each uncoded document to form clusters based on relativesimilarity. The score vector for each uncoded documents includes a setof paired values for tokens identified in that document and weights. Thescore vector is generated by scoring the tokens extracted from eachuncoded document, as described in commonly-assigned U.S. Pat. No.7,610,313 the disclosure of which is incorporated by reference.

As an initial step for generating score vectors, each token for anuncoded document is individually scored. Next, a normalized score vectoris created for the uncoded document by identifying paired values,consisting of a token occurring in that document and the scores for thattoken. The paired values are ordered along a vector to generate thescore vector. The paired values can be ordered based on tokens,including concepts or frequency, as well as other factors. For example,assume a normalized score vector for a first uncoded document A is{right arrow over (S)}_(A)={(5, 0.5), (120, 0.75)} and a normalizedscore vector for another uncoded document B is {right arrow over(S)}_(B)={(3, 0.4), (5, 0.75), (47, 0.15)}. Document A has scorescorresponding to tokens ‘5’ and ‘120’ and Document B has scorescorresponding to tokens ‘3,’ ‘5’ and ‘47.’ Thus, these uncoded documentsonly have token ‘5’ in common. Once generated, the score vectors can becompared to determine similarity or dissimilarity between thecorresponding uncoded documents during clustering.

The routine for forming clusters proceeds in two phases. During thefirst phase (blocks 53-58), uncoded documents are evaluated to identifya set of seed documents, which can be used to form new clusters. Duringthe second phase (blocks 60-66), the uncoded documents not previouslyplaced are evaluated and grouped into existing clusters based on abest-fit criterion.

Initially, a single cluster is generated with one or more uncodeddocuments as seed documents and additional clusters of uncoded documentsare added. Each cluster is represented by a cluster center that isassociated with a score vector, which is representative of the tokens inall the documents for that cluster. In the following discussion relatingto FIG. 3, the tokens include concepts. However, other tokens arepossible, as described above. The cluster center score vector can begenerated by comparing the score vectors for the individual uncodeddocuments in the cluster and identifying the most common concepts sharedby the uncoded documents. The most common concepts and the associatedweights are ordered along the cluster center score vector. Clustercenters, and thus, cluster center score vectors may continually changedue to the addition and removal of documents during clustering.

During clustering, the uncoded documents are identified (block 51) andordered by length (block 52). The uncoded documents can include alluncoded documents in a corpus or can include only those uncodeddocuments for a single assignment. Each uncoded document is thenprocessed in an iterative processing loop (blocks 53-58) as follows. Thesimilarity between each uncoded document and the cluster centers, basedon uncoded documents already clustered, is determined (block 54) as thecosine (cos) σ of the score vectors for the uncoded documents andcluster being compared. The cos σ provides a measure of relativesimilarity or dissimilarity between tokens, including the concepts, inthe uncoded documents and is equivalent to the inner products betweenthe score vectors for the uncoded document and cluster center.

In the described embodiment, the cos σ is calculated in accordance withthe equation:

${\cos \; \sigma_{AB}} = \frac{\langle{{\overset{\rightarrow}{S}}_{A} \cdot {\overset{\rightarrow}{S}}_{B}}\rangle}{{{\overset{\rightarrow}{S}}_{A}}{{\overset{\rightarrow}{S}}_{B}}}$

where cos σ_(AB) comprises the similarity metric between uncodeddocument A and cluster center B, {right arrow over (S)}_(A) comprises ascore vector for the uncoded document A, and {right arrow over (S)}_(B)comprises a score vector for the cluster center B. Other forms ofdetermining similarity using a distance metric are feasible, as would berecognized by one skilled in the art. An example includes usingEuclidean distance.

Only those uncoded documents that are sufficiently distinct from allcluster centers (block 55) are selected as seed documents for formingnew clusters (block 56). If the uncoded documents being compared are notsufficiently distinct (block 55), each uncoded document is then groupedinto a cluster with the most similar cluster center (block 57).Processing continues with the next uncoded document (block 58).

In the second phase, each uncoded document not previously placed isiteratively processed in an iterative processing loop (blocks 60-66) asfollows. Again, the similarity between each remaining uncoded documentand each cluster center is determined based on a distance (block 61) asthe cos σ of the normalized score vectors for the remaining uncodeddocument and the cluster center. A best fit between the remaininguncoded document and one of the cluster centers can be found subject toa minimum fit criterion (block 62). In the described embodiment, aminimum fit criterion of 0.25 is used, although other minimum fitcriteria could be used. If a best fit is found (block 63), the remaininguncoded document is grouped into the cluster having the best fit (block65). Otherwise, the remaining uncoded document is grouped into amiscellaneous cluster (block 64). Processing continues with the nextremaining uncoded document (block 66). Finally, a dynamic threshold canbe applied to each cluster (block 67) to evaluate and strengthendocument membership in a particular cluster. The dynamic threshold isapplied based on a cluster-by-cluster basis, as described incommonly-assigned U.S. Pat. No. 7,610,313, the disclosure of which isincorporated by reference. The routine then returns. Other methods andprocesses for forming clusters are possible.

Identifying Similar Reference Documents

Once a cluster set is obtained, one or more uncoded documents associatedwith a cluster are compared to a set of reference documents to identifya subset of the reference documents that are similar. The similarity isdetermined based on a similarity metric, which can include a distancemetric. The similarity metric can be determined as the cos σ of thescore vectors for the reference documents and clusters associated withthe one or more uncoded documents. The one or more uncoded documents canbe selected based on a cluster measure. FIG. 4 is a block diagramshowing, by way of example, cluster measures 70 for comparing uncodeddocuments with and identifying similar reference documents for use inthe method of FIG. 2. One or more uncoded documents in at least onecluster are compared with the reference documents to identify a subsetof the reference documents that are similar. More specifically, thecluster of the one or more uncoded documents can be represented by acluster measure, which is compared with the reference documents. Thecluster measures 70 can include a cluster center 71, sample 72, clustercenter and sample 73, and spine 74. Once compared, a similaritythreshold is applied to the reference documents to identify thosereference documents that are most similar.

Identifying similar reference documents using the cluster center measure71 includes determining a cluster center for each cluster, comparing oneor more of the cluster centers to a set of reference documents, andidentifying the reference documents that satisfy a threshold similaritywith the particular cluster center. More specifically, the score vectorfor the cluster center is compared to score vectors associated with eachreference document as cos σ of the score vectors for the referencedocument and the cluster center. The score vector for the cluster isbased on the cluster center, which considers the score vectors for allthe uncoded documents in that cluster. The sample cluster measure 72includes generating a sample of one or more uncoded documents in asingle cluster that is representative of that cluster. The number ofuncoded documents in the sample can be defined by the reviewer, set as adefault, or determined automatically. Once generated, a score vector iscalculated for the sample by comparing the score vectors for theindividual uncoded documents selected for inclusion in the sample andidentifying the most common concepts shared by the selected documents.The most common concepts and associated weights for the samples arepositioned along a score vector, which is representative of the sampleof uncoded documents for the cluster. The cluster center and samplecluster measure 73 includes comparing both the cluster center scorevector and the sample score vector for a cluster to identify referencedocuments that are similar to the uncoded documents in that cluster.

Further, similar reference documents can be identified based on a spine,which includes those clusters that share one or more tokens, such asconcepts, and are arranged linearly along a vector. The cluster spinesare generated as described in commonly-assigned U.S. Pat. No. 7,271,804,the disclosure of which is incorporated by reference. Also, the clusterspines can be positioned in relation to other cluster spines, asdescribed in commonly-assigned U.S. Pat. No. 7,610,313, issued Oct. 27,2009, the disclosure of which is incorporated by reference. Organizingthe clusters into spines and groups of cluster spines provides anindividual reviewer with a display that presents the uncoded documentsand reference documents according to theme while maximizing the numberof relationships depicted between the documents. Each theme can includeone or more concepts defining a semantic meaning.

The spine cluster measure 74 involves generating a score vector for aspine by comparing the score vectors for the clusters positioned alongthat spine and identifying the most common concepts shared by theclusters. The most common concepts and associated scores are positionedalong a vector to form a spine score vector. The spine score vector iscompared with the score vectors of the reference documents in the set toidentify similar reference documents.

The measure of similarity determined between the reference documents andselected uncoded documents can be calculated as cos σ of thecorresponding score vectors. However, other similarity calculations arepossible. The similarity calculations can be applied to a threshold andthose references documents that satisfy the threshold can be selected asthe most similar. The most similar reference documents selected for acluster can be the same or different from the most similar referencedocuments for the other clusters. Although four types of similaritymetrics are described above, other similarity metrics are possible.

Upon identification, the similar reference documents for a cluster areinjected into that cluster to provide relationships between the similarreference documents and uncoded documents. Identifying the most similarreference documents and injecting those documents can occurcluster-by-cluster or for all the clusters simultaneously. The number ofsimilar reference documents selected for injection can be defined by thereviewer, set as a default, or determined automatically. Otherdeterminations for the number of similar reference documents arepossible. The similar reference documents can provide hints orsuggestions to a reviewer regarding how to classify the uncodeddocuments based on the relationships.

Displaying the Reference Documents

The clusters of uncoded documents and inserted reference documents canbe provided as a display to the reviewer. FIG. 5 is a screenshot 80showing, by way of example, a visual display 81 of reference documents85 in relation to uncoded documents 84. Clusters 83 can be located alonga spine, which is a straight vector, based on a similarity of theuncoded documents in the clusters 83. Each cluster 83 is represented bya circle; however, other shapes, such as squares, rectangles, andtriangles are possible, as described in U.S. Pat. No. 6,888,548, thedisclosure of which is incorporated by reference. The uncoded documents84 are each represented by a smaller circle within the clusters 83,while the reference documents 85 are each represented by a circle with adiamond within the boundaries of the circle. The reference documents 85can be further represented by their assigned classification code.Classification codes can include “privileged,” “responsive,” and“non-responsive,” as well as other codes. Each group of referencedocuments associated with a particular classification code can beidentified by a different color. For instance, “privileged” referencedocuments can be colored blue, while “non-responsive” referencedocuments are red and “responsive” reference documents are green. In afurther embodiment, the reference documents for different classificationcodes can include different symbols. For example, “privileged” referencedocuments can be represented by a circle with an “X” in the center,while “non-responsive” reference documents can include a circle withstriped lines and “responsive” reference documents can include a circlewith dashed lines. Other classification representations for thereference documents are possible.

The display 81 can be manipulated by a individual reviewer via a compass82, which enables the reviewer to navigate, explore, and search theclusters 83 and spines 86 appearing within the compass 82, as furtherdescribed in commonly-assigned U.S. Pat. No. 7,356,777, the disclosureof which is incorporated by reference. Visually, the compass 82emphasizes clusters 83 located within the compass 82, whiledeemphasizing clusters 83 appearing outside of the compass 82.

Spine labels 89 appear outside of the compass 82 at an end of eachcluster spine 86 to connect the outermost cluster of the cluster spine86 to the closest point along the periphery of the compass 82. In oneembodiment, the spine labels 89 are placed without overlap andcircumferentially around the compass 82. Each spine label 89 correspondsto one or more concepts that most closely describe the cluster spines 86appearing within the compass 82. Additionally, the cluster concepts foreach of the spine labels 89 can appear in a concepts list (not shown)also provided in the display. Toolbar buttons 87 located at the top ofthe display 81 enable a user to execute specific commands for thecomposition of the spine groups displayed. A set of pull down menus 88provides further control over the placement and manipulation of clusters83 and cluster spines 86 within the display 81. Other types of controlsand functions are possible.

A document guide 90 can be placed in the display 81. The document guide90 can include a “Selected” field, a “Search Results” field, and detailsregarding the numbers of uncoded documents and reference documentsprovided in the display. The number of uncoded documents includes alluncoded documents within a corpus of documents for a review project orwithin an assignment for the project. The number of reference documentsincludes the total number of reference documents selected for injectioninto the cluster set. The “Selected” field in the document guide 90provides a number of documents within one or more clusters selected bythe reviewer. The reviewer can select a cluster by “double clicking” thevisual representation of that cluster using a mouse. The “SearchResults” field provides a number of uncoded documents and referencedocuments that include a particular search term identified by thereviewer in a search query box 92.

In one embodiment, a garbage can 91 is provided to remove tokens, suchas cluster concepts from consideration in the current set of clusters83. Removed cluster concepts prevent those concepts from affectingfuture clustering, as may occur when a reviewer considers a conceptirrelevant to the clusters 83.

The display 81 provides a visual representation of the relationshipsbetween thematically related documents, including uncoded documents andinjected reference documents. The uncoded documents and injectedreference documents located within a cluster or spine can be comparedbased on characteristics, such as the assigned classification codes ofthe reference documents, a number of reference documents associated witheach classification code, and a number of different classificationcodes, to identify relationships between the uncoded documents andinjected reference documents. The reviewer can use the displayedrelationships as suggestions for classifying the uncoded documents. Forexample, FIG. 6A is a block diagram showing, by way of example, acluster 93 with “privileged” reference documents 95 and uncodeddocuments 94. The cluster 93 includes nine uncoded documents 94 andthree reference 95 documents. The three reference documents 95 are eachclassified as “privileged.” Accordingly, based on the number of“privileged” reference documents 95 present in the cluster 93, theabsence of other classifications of reference documents, and thethematic relationship between the uncoded documents 94 and the“privileged” reference documents 95, the reviewer may be more inclinedto review the uncoded documents 94 in that cluster 93 or to classify oneor more of the uncoded documents 94 as “privileged,” without review.

Alternatively, the three reference documents can be classified as“non-responsive,” instead of “privileged” as in the previous example.FIG. 6B is a block diagram showing, by way of example, a cluster 96 with“non-responsive” reference documents 97 and uncoded documents 94. Thecluster 96 includes nine uncoded documents 94 and three “non-responsive”documents 97. Since the uncoded documents 94 in the cluster arethematically related to the “non-responsive” reference documents 97, thereviewer may wish to assign a “non-responsive” code to one or moreuncoded documents 94 without review, as they are most likely notrelevant to the legal matter associated with the document reviewproject. In making a decision to assign a code, such as“non-responsive,” the reviewer can consider the number of“non-responsive” reference documents, the presence or absence of otherreference document classification codes, and the thematic relationshipbetween the “non-responsive” reference documents and the uncodeddocuments. Thus, the presence of three “non-responsive” referencedocuments 97 in the cluster of uncoded documents provides a suggestionthat the uncoded documents 94 may also be “non-responsive.” Further, thelabel 89 associated with the spine 86 upon which the cluster 96 islocated can be used to influence a suggestion.

A further example can include a combination of “privileged” and“non-responsive” reference documents. For example, FIG. 6C is a blockdiagram showing, by way of example, a cluster 98 with uncoded documents94 and a combination of reference documents 95, 97. The cluster 98 caninclude one “privileged” reference document 95, two “non-responsive”documents 97, and nine uncoded documents 94. The “privileged” 95 and“non-responsive” 97 reference documents can be distinguished bydifferent colors or shapes, as well as other identifiers for the circle.The combination of “privileged” 95 and “non-responsive” 97 referencedocuments within the cluster 98 can suggest to a reviewer that theuncoded reference documents 94 should be reviewed before classificationor that one or more uncoded reference documents 94 should be classifiedas “non-responsive” based on the higher number of “non-responsive”reference documents 97 in the cluster 98. In making a classificationdecision, the reviewer may consider the number of “privileged” referencedocuments 95 versus the number of “non-responsive” reference documents97, as well as the thematic relationships between the uncoded documents94 and the “privileged” 95 and “non-responsive” 97 reference documents.Additionally, the reviewer can identify the closest reference documentto an uncoded document and assign the classification code of the closestreference document to the uncoded document. Other examples,classification codes, and combinations of classification codes arepossible.

Additionally, the reference documents can also provide suggestions forclassifying clusters and spines. The suggestions provided forclassifying a cluster can include factors, such as a presence or absenceof classified documents with different classification codes within thecluster and a quantity of the classified documents associated with eachclassification code in the cluster. The classified documents can includereference documents and newly classified uncoded documents. Theclassification code assigned to the cluster is representative of thedocuments in that cluster and can be the same as or different from oneor more classified documents within the cluster. Further, thesuggestions provided for classifying a spine include factors, such as apresence or absence of classified documents with differentclassification codes within the clusters located along the spine and aquantity of the classified documents for each classification code. Othersuggestions for classifying documents, clusters, and spines arepossible.

Classifying Uncoded Documents

The display of relationships between the uncoded documents and referencedocuments provides suggestions to an individual reviewer. Thesuggestions can indicate a need for manual review of the uncodeddocuments, when review may be unnecessary, and hints for classifying theuncoded documents. Additional information can be provided to assist thereviewer in making classification decisions for the uncoded documents,such as a machine-generated confidence level associated with a suggestedclassification code, as described in commonly-assigned U.S. Pat. No.8,632,223, issued Jan. 21, 2014, the disclosure of which is incorporatedby reference.

The machine-generated suggestion for classification and associatedconfidence level can be determined by a classifier. FIG. 7 is a processflow diagram 100 showing, by way of example, a method for classifyinguncoded documents using a classifier for use in the method of FIG. 2. Anuncoded document is selected from a cluster within a cluster set (block101) and compared to a neighborhood of x-reference documents (block102), also located within the cluster, to identify those referencedocuments in the neighborhood that are most relevant to the selecteduncoded document. In a further embodiment, a machine-generatedsuggestion for classification and an associated confidence level can beprovided for a cluster or spine by selecting and comparing the clusteror spine to a neighborhood of x-reference documents determined for theselected cluster or spine, as further discussed below.

The neighborhood of x-reference documents is determined separately foreach selected uncoded document and can include one or more injectedreference documents within that cluster. During neighborhood generation,the x-number of reference documents in a neighborhood can first bedetermined automatically or by an individual reviewer. Next, thex-number of reference documents nearest in distance to the selecteduncoded document are identified. Finally, the identified x-number ofreference documents are provided as the neighborhood for the selecteduncoded document. In a further embodiment, the x-number of referencedocuments are defined for each classification code, rather than acrossall classification codes. Once generated, the x-number of referencedocuments in the neighborhood and the selected uncoded document areanalyzed by the classifier to provide a machine-generated classificationsuggestion (block 103). A confidence level for the suggestedclassification is also provided (block 104).

The analysis of the selected uncoded document and x-number of referencedocuments can be based on one or more routines performed by theclassifier, such as a nearest neighbor (NN) classifier. The routines fordetermining a suggested classification code for an uncoded documentinclude a minimum distance classification measure, also known as closestneighbor, minimum average distance classification measure, maximum countclassification measure, and distance weighted maximum countclassification measure. The minimum distance classification measureincludes identifying a neighbor that is the closest distance to theselected uncoded document and assigning the classification code of theclosest neighbor as the suggested classification code for the selecteduncoded document. The closest neighbor is determined by comparing scorevectors for the selected uncoded document with each of the x-numberreference documents in the neighborhood as the cos σ to determine adistance metric. The distance metrics for the x-number of referencedocuments are compared to identify the reference document closest to theselected uncoded document as the closest neighbor.

The minimum average distance classification measure includes calculatingan average distance of the reference documents in a cluster for eachclassification code. The classification code of the reference documentshaving the closest average distance to the selected uncoded document isassigned as the suggested classification code. The maximum countclassification measure, also known as the voting classification measure,includes counting a number of reference documents within the cluster foreach classification code and assigning a count or “vote” to thereference documents based on the assigned classification code. Theclassification code with the highest number of reference documents or“votes” is assigned to the selected uncoded document as the suggestedclassification. The distance weighted maximum count classificationmeasure includes identifying a count of all reference documents withinthe cluster for each classification code and determining a distancebetween the selected uncoded document and each of the referencedocuments. Each count assigned to the reference documents is weightedbased on the distance of the reference document from the selecteduncoded document. The classification code with the highest count, afterconsideration of the weight, is assigned to the selected uncodeddocument as the suggested classification.

The x-NN classifier provides the machine-generate classification codewith a confidence level that can be presented as an absolute value orpercentage. Other confidence level measures are possible. The reviewercan use the suggested classification code and confidence level to assigna classification to the selected uncoded document. Alternatively, thex-NN classifier can automatically assign the suggested classification.In one embodiment, the x-NN classifier only assigns an uncoded documentwith the suggested classification code if the confidence level is abovea threshold value, which can be set by the reviewer or the x-NNclassifier.

As briefly described above, classification can also occur on a clusteror spine level. For instance, for cluster classification, a cluster isselected and a score vector for the center of the cluster is determinedas described above with reference to FIG. 3. A neighborhood for theselected cluster is determined based on a distance metric. The x-numberof reference documents that are closest to the cluster center can beselected for inclusion in the neighborhood, as described above. Eachreference document in the selected cluster is associated with a scorevector and the distance is determined by comparing the score vector ofthe cluster center with the score vector of each reference document todetermine an x-number of reference documents that are closest to thecluster center. However, other methods for generating a neighborhood arepossible. Once determined, one of the classification measures is appliedto the neighborhood to determine a suggested classification code andconfidence level for the selected cluster.

Throughout the process of identifying similar reference documents andinjecting the reference documents into a cluster to provide aclassification suggestion, the reviewer can retain control over manyaspects, such as a source of the reference documents and a number ofsimilar reference documents to be selected. FIG. 8 is a screenshot 110showing, by way of example, a reference options dialogue box 111 forentering user preferences for reference document injection. The dialoguebox 111 can be accessed via a pull-down menu as described above withrespect to FIG. 5. Within the dialogue box 111, the reviewer can utilizeuser-selectable parameters to define a source of reference documents112, filter the reference documents by category 113, select a target forthe reference documents 114, select an action to be performed upon thereference documents 115, define timing of the injection 116, define acount of similar reference documents to be injected into a cluster 117,select a location of injection within a cluster 118, and compile a listof injection commands 119. Each user-selectable option can include atext box for entry of a user preference or a drop-down menu withpredetermined options for selection by a reviewer. Other user-selectableoptions and displays are possible.

The reference source parameter 112 allows the reviewer to identify oneor more sources of the reference documents. The sources can include allpreviously classified reference documents in a document review project,all reference documents for which the associated classification has beenverified, all reference documents that have been analyzed or allreference documents in a particular binder. The binder can includecategories of reference documents, such as reference documents that areparticular to the document review project or that are related to a priordocument review project. The category filter parameter 113 allows thereviewer to generate and display the set of reference documents usingonly those reference documents associated with a particularclassification code. The target parameter 114 allows the reviewer toselect a target for injection of the similar reference documents.Options available for the target parameter 114 can include anassignment, all clusters, select clusters, all spines, select spines,all documents, and select documents. The assignment can be representedas a cluster set; however, other representations are possible, includinga file hierarchy and a list of documents, such as an email folder, asdescribed in commonly-assigned U.S. Pat. No. 7,404,151, the disclosureof which is incorporated by reference

The action parameter 115 allows the reviewer to define display optionsfor the injected reference documents. The display options can includeinjecting the similar reference documents into a map display of theclusters, displaying the similar reference documents in the map untilreclustering occurs, displaying the injected reference documents in themap, and not displaying the injected reference documents in the map.Using the automatic parameter 116, the reviewer can define a time forinjection of the similar reference documents. The timing options caninclude injecting the similar reference documents upon opening of anassignment, upon reclustering, or upon changing the selection of thetarget. The reviewer can specify a threshold number of similar referencedocuments to be injected in each cluster or spine via the similarityoption 117. The number selected by a reviewer is an upper thresholdsince a lesser number of similar reference documents may be identifiedfor injecting into a cluster or spine. Additionally, the reviewer canuse the similarity option 117 to set a value for determining whether areference document is sufficiently similar to the uncoded documents.

Further, the reviewer can select a location within the cluster forinjection of the similar reference documents via the cluster siteparameter 118. Options for cluster site injection can include thecluster centroid. Other cluster sites are possible. The user-selectableoptions for each preference can be compiled as a list of injectioncommands 119 for use in the injection process. Other user selectableparameters, options, and actions are possible.

The clustering of uncoded documents and injection of similar referencedocuments in the clusters has been described in relation to documents;however, in a further embodiment, the cluster and injection process canbe applied to tokens. For example, uncoded tokens are clustered andsimilar reference tokens are injected into the clusters and displayed toprovide classification suggestions based on relationships between theuncoded tokens and similar reference tokens. The uncoded documents canthen be classified based on the classified tokens. In one embodiment,the tokens include concepts, n-grams, raw terms, and entities. While theinvention has been particularly shown and described as referenced to theembodiments thereof, those skilled in the art will understand that theforegoing and other changes in form and detail may be made thereinwithout departing from the spirit and scope.

What is claimed is:
 1. A system for providing document classification suggestions, comprising: a database to store clusters of uncoded documents; and a server comprising a central processing unit, memory, an input port to receive the clusters from the database, and an output port, wherein the central processing unit is configured to: compare one or more of the uncoded documents in one such cluster to a set of reference documents each assigned with a classification code; identify those reference documents in the set that are similar to the one or more uncoded documents; identify different types of the classification codes for at least a portion of the similar reference documents and obtaining a count of the classification codes assigned to the portion of similar reference documents for each classification code type; and provide a suggestion for classification of at least one of the one or more uncoded documents based on the count of classification codes for each classification type and one of a presence and absence of each classification code type.
 2. A system according to claim 1, further comprising: a selection module to select the classification type that is present and has a highest count of classification codes as the suggested classification.
 3. A system according to claim 1, further comprising: a cluster classification module to classify the cluster of the one or more uncoded documents, comprising: a counter module to obtain a count of classification codes for each of the classification types based on the similar reference documents and the uncoded documents with assigned classification codes in the cluster; and an assignment module to assign to the cluster the classification code of the classification type with a highest count of classification codes in the cluster.
 4. A system according to claim 1, further comprising: a display to provide the clusters along one or more spines based on a similarity of the uncoded documents.
 5. A system according to claim 4, further comprising: a determination module to determine the spine on which the cluster is placed; and a label module to identify a label associated with the identified spine, wherein the label is considered with the count of classification codes for each classification type and one of a presence and absence of each classification code type for the assignment of the classification code.
 6. A system according to claim 4, further comprising: a document identification module to identify the similar reference documents, comprising: a determination module to determine a measure for the spine on which the cluster is placed; a comparison module to compare the measure with each of the reference documents; and a selection module to select the reference documents most similar to the measure as the similar reference documents.
 7. A system according to claim 1, further comprising: a confidence module to determine a confidence level of the suggested classification and to provide the confidence level with the suggested classification.
 8. A system according to claim 1, further comprising: a document identification module to identify the similar reference documents, comprising: a determination module to determine a measure for the one or more uncoded documents in the cluster, wherein the measure comprises one of a center of the cluster, a sample of the uncoded documents, and the cluster center and the sample; a comparison module to compare the measure with each of the reference documents; and a selection module to select the reference documents most similar to the measure as the similar reference documents.
 9. A system according to claim 1, wherein the similar reference documents satisfy a threshold number.
 10. A system according to claim 1, further comprising: a receipt module to receive from a user, a location for injecting the similar reference documents into the cluster.
 11. A method for providing document classification suggestions, comprising: obtaining clusters of uncoded documents; comparing one or more of the uncoded documents in one such cluster to a set of reference documents each assigned with a classification code; identifying those reference documents in the set that are similar to the one or more uncoded documents; identifying different types of the classification codes for at least a portion of the similar reference documents and obtaining a count of the classification codes assigned to the portion of similar reference documents for each classification code type; and providing a suggestion for classification of at least one of the one or more uncoded documents based on the count of classification codes for each classification type and one of a presence and absence of each classification code type.
 12. A method according to claim 11, further comprising: selecting the classification type that is present and has a highest count of classification codes as the suggested classification.
 13. A method according to claim 11, further comprising: classifying the cluster with the one or more uncoded documents, comprising: obtaining a count of classification codes for each of the classification types based on the similar reference documents and the uncoded documents with assigned classification codes in the cluster; and assigning to the cluster the classification code of the classification type with a highest count of classification codes in the cluster.
 14. A method according to claim 11, further comprising: displaying the clusters along one or more spines based on a similarity of the uncoded documents.
 15. A method according to claim 14, further comprising: determining the spine on which the cluster is placed; and identifying a label associated with the identified spine, wherein the label is considered with the count of classification codes for each classification type and one of a presence and absence of each classification code type for the assignment of the classification code.
 16. A method according to claim 14, further comprising: identifying the similar reference documents, comprising: determining a measure for the spine on which the cluster is placed; comparing the measure with each of the reference documents; and selecting the reference documents most similar to the measure as the similar reference documents.
 17. A method according to claim 11, further comprising: determining a confidence level of the suggested classification; and providing the confidence level with the suggested classification.
 18. A method according to claim 11, further comprising: identifying the similar reference documents, comprising: determining a measure for the one or more uncoded documents in the cluster, wherein the measure comprises one of a center of the cluster, a sample of the uncoded documents, and the cluster center and the sample; comparing the measure with each of the reference documents; and selecting the reference documents most similar to the measure as the similar reference documents.
 19. A method according to claim 11, wherein the similar reference documents satisfy a threshold number.
 20. A method according to claim 11, further comprising: receiving from a user, a location for injecting the similar reference documents into the cluster. 